This is one page of the R Handbook for Epidemiologists, but is being printed as a stand-alone page.
You can find the complete handbook on Github
This page covers:
Load packages
One can use the R package DiagrammeR to create charts/flow charts. They can be static, or they can adjust somewhat dynamically based on changes in a dataset.
Tools
The function grViz() is used to create a “Graphviz” diagram. This function accepts a character string input containing instructions for making the diagram. Within that string, the instructions are written in a different language, called DOT - it is quite easy to learn the basics.
Basic structure
grViz("digraph my_flow_chart {}")Below are two simple examples
A very minimal example:
# A minimal plot
DiagrammeR::grViz("digraph {
graph[layout = dot, rankdir = LR]
a
b
c
a -> b -> c
}")An example with applied public health context:
grViz(" # All instructions are within a large character string
digraph surveillance_diagram { # 'digraph' means 'directional graph', then the graph name
# graph statement
#################
graph [layout = dot,
rankdir = TB,
overlap = true,
fontsize = 10]
# nodes
#######
node [shape = circle, # shape = circle
fixedsize = true
width = 1.3] # width of circles
Primary # names of nodes
Secondary
Tertiary
# edges
#######
Primary -> Secondary [label = 'case transfer']
Secondary -> Tertiary [label = 'case transfer']
}
")Basic syntax
Node names, or edge statements, can be separated with spaces, semicolons, or newlines.
Rank direction
A plot can be re-oriented to move left-to-right by adjusting the rankdir argument within the graph statement. The default is TB (top-to-bottom), but it can be LR (left-to-right), RL, or BT.
Node names
Node names can be single words, as in the simple example above. To use multi-word names or special characters (e.g. parentheses, dashes), put the node name within single quotes (’ ’). It may be easier to have a short node name, and assign a label, as shown below within brackets [ ]. A label is also necessary to have a newline within the node name - use \n in the node label within single quotes, as shown below.
Subgroups
Within edge statements, subgroups can be created on either side of the edge with curly brackets ({ }). The edge then applies to all nodes in the bracket - it is a shorthand.
Layouts
rankdir to either TB, LR, RL, BT, )Nodes - editable attributes
label (text, in single quotes if multi-word)fillcolor (many possible colors)fontcoloralpha (transparency 0-1)shape (ellipse, oval, diamond, egg, plaintext, point, square, triangle)stylesidesperipheriesfixedsize (h x w)heightwidthdistortionpenwidth (width of shape border)x (displacement left/right)y (displacement up/down)fontnamefontsizeiconEdges - editable attributes
arrowsizearrowhead (normal, box, crow, curve, diamond, dot, inv, none, tee, vee)arrowtaildir (direction, )style (dashed, …)coloralphaheadport (text in front of arrowhead)tailport (text in behind arrowtail)fontnamefontsizefontcolorpenwidth (width of arrow)minlen (minimum length)Color names: hexadecimal values or ‘X11’ color names, see here for X11 details
The example below expands on the surveillance_diagram, adding complex node names, grouped edges, colors and styling
grViz(" # All instructions are within a large character string
digraph surveillance_diagram { # 'digraph' means 'directional graph', then the graph name
# graph statement
#################
graph [layout = dot,
rankdir = TB, # layout top-to-bottom
fontsize = 10]
# nodes (circles)
#################
node [shape = circle, # shape = circle
fixedsize = true
width = 1.3]
Primary [label = 'Primary\nFacility']
Secondary [label = 'Secondary\nFacility']
Tertiary [label = 'Tertiary\nFacility']
SC [label = 'Surveillance\nCoordination',
fontcolor = darkgreen]
# edges
#######
Primary -> Secondary [label = 'case transfer',
fontcolor = red,
color = red]
Secondary -> Tertiary [label = 'case transfer',
fontcolor = red,
color = red]
# grouped edge
{Primary Secondary Tertiary} -> SC [label = 'case reporting',
fontcolor = darkgreen,
color = darkgreen,
style = dashed]
}
")Sub-graph clusters
To group nodes into boxed clusters, put them within the same named subgraph (subgraph name {}). To have the subgraph identified within a box, begin the name with “cluster” as shown below.
grViz(" # All instructions are within a large character string
digraph surveillance_diagram { # 'digraph' means 'directional graph', then the graph name
# graph statement
#################
graph [layout = dot,
rankdir = TB,
overlap = true,
fontsize = 10]
# nodes (circles)
#################
node [shape = circle, # shape = circle
fixedsize = true
width = 1.3] # width of circles
subgraph cluster_passive {
Primary [label = 'Primary\nFacility']
Secondary [label = 'Secondary\nFacility']
Tertiary [label = 'Tertiary\nFacility']
SC [label = 'Surveillance\nCoordination',
fontcolor = darkgreen]
}
# nodes (boxes)
###############
node [shape = box, # node shape
fontname = Helvetica] # text font in node
subgraph cluster_active {
Active [label = 'Active\nSurveillance'];
HCF_active [label = 'HCF\nActive Search']
}
subgraph cluster_EBD {
EBS [label = 'Event-Based\nSurveillance (EBS)'];
'Social Media'
Radio
}
subgraph cluster_CBS {
CBS [label = 'Community-Based\nSurveillance (CBS)'];
RECOs
}
# edges
#######
{Primary Secondary Tertiary} -> SC [label = 'case reporting']
Primary -> Secondary [label = 'case transfer',
fontcolor = red]
Secondary -> Tertiary [label = 'case transfer',
fontcolor = red]
HCF_active -> Active
{'Social Media'; Radio} -> EBS
RECOs -> CBS
}
")node shapes
The example below, borrowed from this tutorial, shows applied node shapes, and shows a shorthand for serial edge connections
DiagrammeR::grViz("digraph {
graph [layout = dot, rankdir = LR]
# define the global styles of the nodes. We can override these in box if we wish
node [shape = rectangle, style = filled, fillcolor = Linen]
data1 [label = 'Dataset 1', shape = folder, fillcolor = Beige]
data2 [label = 'Dataset 2', shape = folder, fillcolor = Beige]
process [label = 'Process \n Data']
statistical [label = 'Statistical \n Analysis']
results [label= 'Results']
# edge definitions with the node IDs
{data1 data2} -> process -> statistical -> results
}")“Parameterized figures: A great benefit of designing figures within R is that we are able to connect the figures directly with our analysis by reading R values directly into our flowcharts. For example, suppose you have created a filtering process which removes values after each stage of a process, you can have a figure show the number of values left in the dataset after each stage of your process. To do this we, you can use the @@X symbol directly within the figure, then refer to this in the footer of the plot using [X]:, where X is the a unique numeric index. Here is a basic example:”
https://mikeyharper.uk/flowcharts-in-r-using-diagrammer/
# Define some sample data
data <- list(a=1000, b=800, c=600, d=400)
DiagrammeR::grViz("
digraph graph2 {
graph [layout = dot]
# node definitions with substituted label text
node [shape = rectangle, width = 4, fillcolor = Biege]
a [label = '@@1']
b [label = '@@2']
c [label = '@@3']
d [label = '@@4']
a -> b -> c -> d
}
[1]: paste0('Raw Data (n = ', data$a, ')')
[2]: paste0('Remove Errors (n = ', data$b, ')')
[3]: paste0('Identify Potential Customers (n = ', data$c, ')')
[4]: paste0('Select Top Priorities (n = ', data$d, ')')
")Much of the above is adapted from the tutorial at this site
Other more in-depth tutorial: http://rich-iannone.github.io/DiagrammeR/
Note above is out of date via DiagrammeR
Plotting the connections in a dataset
https://www.r-graph-gallery.com/321-introduction-to-interactive-sankey-diagram-2.html
Counts of age category and hospital, relabled as target and source, respectively.
# counts by hospital and age category
links <- linelist %>%
select(hospital, age_cat) %>%
count(hospital, age_cat) %>%
rename(source = hospital,
target = age_cat)Now formalize the nodes list, and adjust the ID columns to be numbers instead of labels:
# The unique node names
nodes <- data.frame(
name=c(as.character(links$source), as.character(links$target)) %>%
unique()
)
# match to numbers, not names
links$IDsource <- match(links$source, nodes$name)-1
links$IDtarget <- match(links$target, nodes$name)-1Now plot the Sankey diagram:
# plot
######
p <- sankeyNetwork(Links = links,
Nodes = nodes,
Source = "IDsource",
Target = "IDtarget",
Value = "n",
NodeID = "name",
units = "TWh",
fontSize = 12,
nodeWidth = 30)
pHere is an example where the patient Outome is included as well. Note in the data management step how we bind rows of counts of hospital -> outcome, using the same column names.
# counts by hospital and age category
links <- linelist %>%
select(hospital, age_cat) %>%
mutate(age_cat = stringr::str_glue("Age {age_cat}")) %>%
count(hospital, age_cat) %>%
rename(source = age_cat,
target = hospital) %>%
bind_rows(
linelist %>%
select(hospital, outcome) %>%
count(hospital, outcome) %>%
rename(source = hospital,
target = outcome)
)
# The unique node names
nodes <- data.frame(
name=c(as.character(links$source), as.character(links$target)) %>%
unique()
)
# match to numbers, not names
links$IDsource <- match(links$source, nodes$name)-1
links$IDtarget <- match(links$target, nodes$name)-1
# plot
######
p <- sankeyNetwork(Links = links,
Nodes = nodes,
Source = "IDsource",
Target = "IDtarget",
Value = "n",
NodeID = "name",
units = "TWh",
fontSize = 12,
nodeWidth = 30)
phttps://www.displayr.com/sankey-diagrams-r/
Timeline Sankey - LTFU from cohort… application/rejections… etc.
E.g. border closures during COVID
This tab should stay with the name “Resources”. Links to other online tutorials or resources.